{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved.  \n",
    "Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Inference PyTorch Bert Model with ONNX Runtime on CPU"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial, you'll be introduced to how to load a Bert model from PyTorch, convert it to ONNX, and inference it for high performance using ONNX Runtime. In the following sections, we are going to use the Bert model trained with Stanford Question Answering Dataset (SQuAD) dataset as an example. Bert SQuAD model is used in question answering scenarios, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.\n",
    "\n",
    "This notebook is for CPU inference. For GPU inferenece, please look at another notebook [Inference PyTorch Bert Model with ONNX Runtime on GPU](PyTorch_Bert-Squad_OnnxRuntime_GPU.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Prerequisites ##\n",
    "\n",
    "If you have Jupyter Notebook, you may directly run this notebook. We will use pip to install or upgrade [PyTorch](https://pytorch.org/), [OnnxRuntime](https://microsoft.github.io/onnxruntime/) and other required packages.\n",
    "\n",
    "Otherwise, you can setup a new environment. First, we install [AnaConda](https://www.anaconda.com/distribution/). Then open an AnaConda prompt window and run the following commands:\n",
    "\n",
    "```console\n",
    "conda create -n cpu_env python=3.10\n",
    "conda activate cpu_env\n",
    "pip install jupyterlab\n",
    "conda install ipykernel\n",
    "conda install -c conda-forge ipywidgets\n",
    "ipython kernel install --user --name cpu_env\n",
    "jupyter-lababook\n",
    "```\n",
    "The last command will launch Jupyter Notebook and we can open this notebook in browser to continue."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Please refer to https://pytorch.org to install CPU-only PyTorch\n",
    "import sys\n",
    "if sys.platform in [\"darwin\", \"win32\"]:  # Mac or Windows\n",
    "    !{sys.executable} -m pip install torch -q\n",
    "else:\n",
    "    !{sys.executable} -m pip install install torch --index-url https://download.pytorch.org/whl/cpu -q\n",
    "\n",
    "# Install other packages used in this notebook.\n",
    "!{sys.executable} -m pip install onnxruntime onnx transformers==4.18 psutil pandas py-cpuinfo py3nvml wget netron -q"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Load Pretrained Bert model ##"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We begin by downloading the SQuAD data file and store them in the specified location."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Start downloading predict file.\n",
      "100% [..........................................................................] 4854279 / 4854279Predict file downloaded.\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "\n",
    "cache_dir = os.path.join(\".\", \"cache_models\")\n",
    "if not os.path.exists(cache_dir):\n",
    "    os.makedirs(cache_dir)\n",
    "\n",
    "predict_file_url = \"https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json\"\n",
    "predict_file = os.path.join(cache_dir, \"dev-v1.1.json\")\n",
    "if not os.path.exists(predict_file):\n",
    "    import wget\n",
    "    print(\"Start downloading predict file.\")\n",
    "    wget.download(predict_file_url, predict_file)\n",
    "    print(\"Predict file downloaded.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Specify some model configuration variables and constant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# For fine tuned large model, the model name is \"bert-large-uncased-whole-word-masking-finetuned-squad\". Here we use bert-base for demo.\n",
    "model_name_or_path = \"bert-base-cased\"\n",
    "max_seq_length = 128\n",
    "doc_stride = 128\n",
    "max_query_length = 64\n",
    "\n",
    "# Enable overwrite to export onnx model and download latest script each time when running this notebook.\n",
    "enable_overwrite = True\n",
    "\n",
    "# Total samples to inference. It shall be large enough to get stable latency measurement.\n",
    "total_samples = 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Start to load model from pretrained. This step could take a few minutes. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2d13e6ef5b5840d8b58930b908b1a0c8",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "74cc5552b28c4ca4957b04e03a8d8db1",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f9f3c5c1a8494d048c0ead297812907c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "377f46f6e3e64e628514eec9a01d54e2",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForQuestionAnswering: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']\n",
      "- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
      "- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
      "Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']\n",
      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n",
      "100%|██████████████████████████████████████████████████████████████████████████████████| 48/48 [00:04<00:00, 11.48it/s]\n",
      "convert squad examples to features: 100%|███████████████████████████████████████████| 100/100 [00:00<00:00, 129.66it/s]\n",
      "add example index and unique id: 100%|███████████████████████████████████████████████████████| 100/100 [00:00<?, ?it/s]\n"
     ]
    }
   ],
   "source": [
    "# The following code is adapted from HuggingFace transformers\n",
    "# https://github.com/huggingface/transformers/blob/master/examples/run_squad.py\n",
    "\n",
    "from transformers import (BertConfig, BertForQuestionAnswering, BertTokenizer)\n",
    "\n",
    "# Load pretrained model and tokenizer\n",
    "config_class, model_class, tokenizer_class = (BertConfig, BertForQuestionAnswering, BertTokenizer)\n",
    "config = config_class.from_pretrained(model_name_or_path, cache_dir=cache_dir)\n",
    "tokenizer = tokenizer_class.from_pretrained(model_name_or_path, do_lower_case=True, cache_dir=cache_dir)\n",
    "model = model_class.from_pretrained(model_name_or_path,\n",
    "                                    from_tf=False,\n",
    "                                    config=config,\n",
    "                                    cache_dir=cache_dir)\n",
    "# load some examples\n",
    "from transformers.data.processors.squad import SquadV1Processor\n",
    "\n",
    "processor = SquadV1Processor()\n",
    "examples = processor.get_dev_examples(None, filename=predict_file)\n",
    "\n",
    "from transformers import squad_convert_examples_to_features\n",
    "features, dataset = squad_convert_examples_to_features( \n",
    "            examples=examples[:total_samples], # convert just enough examples for this notebook\n",
    "            tokenizer=tokenizer,\n",
    "            max_seq_length=max_seq_length,\n",
    "            doc_stride=doc_stride,\n",
    "            max_query_length=max_query_length,\n",
    "            is_training=False,\n",
    "            return_dataset='pt'\n",
    "        )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Export the loaded model ##\n",
    "Once the model is loaded, we can export the loaded PyTorch model to ONNX."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "============== Diagnostic Run torch.onnx.export version 2.0.1+cpu ==============\n",
      "verbose: False, log level: Level.ERROR\n",
      "======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================\n",
      "\n",
      "Model exported at  ..\\onnx_models\\bert-base-cased-squad.onnx\n"
     ]
    }
   ],
   "source": [
    "output_dir = os.path.join(\".\", \"onnx_models\")\n",
    "if not os.path.exists(output_dir):\n",
    "    os.makedirs(output_dir)   \n",
    "export_model_path = os.path.join(output_dir, 'bert-base-cased-squad.onnx')\n",
    "\n",
    "import torch\n",
    "device = torch.device(\"cpu\")\n",
    "\n",
    "# Get the first example data to run the model and export it to ONNX\n",
    "data = dataset[0]\n",
    "inputs = {\n",
    "    'input_ids':      data[0].to(device).reshape(1, max_seq_length),\n",
    "    'attention_mask': data[1].to(device).reshape(1, max_seq_length),\n",
    "    'token_type_ids': data[2].to(device).reshape(1, max_seq_length)\n",
    "}\n",
    "\n",
    "# Set model to inference mode, which is required before exporting the model because some operators behave differently in \n",
    "# inference and training mode.\n",
    "model.eval()\n",
    "model.to(device)\n",
    "\n",
    "if enable_overwrite or not os.path.exists(export_model_path):\n",
    "    with torch.no_grad():\n",
    "        symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}\n",
    "        torch.onnx.export(model,                                            # model being run\n",
    "                          args=tuple(inputs.values()),                      # model input (or a tuple for multiple inputs)\n",
    "                          f=export_model_path,                              # where to save the model (can be a file or file-like object)\n",
    "                          opset_version=11,                                 # the ONNX version to export the model to\n",
    "                          do_constant_folding=True,                         # whether to execute constant folding for optimization\n",
    "                          input_names=['input_ids',                         # the model's input names\n",
    "                                       'input_mask', \n",
    "                                       'segment_ids'],\n",
    "                          output_names=['start', 'end'],                    # the model's output names\n",
    "                          dynamic_axes={'input_ids': symbolic_names,        # variable length axes\n",
    "                                        'input_mask' : symbolic_names,\n",
    "                                        'segment_ids' : symbolic_names,\n",
    "                                        'start' : symbolic_names,\n",
    "                                        'end' : symbolic_names})\n",
    "        print(\"Model exported at \", export_model_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. PyTorch Inference ##\n",
    "Use PyTorch to evaluate an example input for comparison purpose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PyTorch cpu Inference time = 178.87 ms\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "\n",
    "# Measure the latency. It is not accurate using Jupyter Notebook, it is recommended to use standalone python script.\n",
    "latency = []\n",
    "with torch.no_grad():\n",
    "    for i in range(total_samples):\n",
    "        data = dataset[i]\n",
    "        inputs = {\n",
    "            'input_ids':      data[0].to(device).reshape(1, max_seq_length),\n",
    "            'attention_mask': data[1].to(device).reshape(1, max_seq_length),\n",
    "            'token_type_ids': data[2].to(device).reshape(1, max_seq_length)\n",
    "        }\n",
    "        start = time.time()\n",
    "        outputs = model(**inputs)\n",
    "        latency.append(time.time() - start)\n",
    "print(\"PyTorch {} Inference time = {} ms\".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Inference ONNX Model with ONNX Runtime ##\n",
    "\n",
    "For Onnx Runtime 1.6.0 or older, OpenMP environment variables are very important for CPU inference of Bert model. Since 1.7.0, the official package is not built with OpenMP."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we inference the model with ONNX Runtime. Here we can see that OnnxRuntime has better performance than PyTorch. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "OnnxRuntime cpu Inference time = 84.11 ms\n"
     ]
    }
   ],
   "source": [
    "import onnxruntime\n",
    "import numpy\n",
    "\n",
    "sess_options = onnxruntime.SessionOptions()\n",
    "\n",
    "# Optional: store the optimized graph and view it using Netron to verify that model is fully optimized.\n",
    "# Note that this will increase session creation time, so it is for debugging only.\n",
    "sess_options.optimized_model_filepath = os.path.join(output_dir, \"optimized_model_cpu.onnx\")\n",
    "\n",
    "# Specify providers when you use onnxruntime-gpu for CPU inference.\n",
    "session = onnxruntime.InferenceSession(export_model_path, sess_options, providers=['CPUExecutionProvider'])\n",
    "\n",
    "latency = []\n",
    "for i in range(total_samples):\n",
    "    data = dataset[i]\n",
    "    ort_inputs = {\n",
    "        'input_ids':  data[0].cpu().reshape(1, max_seq_length).numpy(),\n",
    "        'input_mask': data[1].cpu().reshape(1, max_seq_length).numpy(),\n",
    "        'segment_ids': data[2].cpu().reshape(1, max_seq_length).numpy()\n",
    "    }\n",
    "    start = time.time()\n",
    "    ort_outputs = session.run(None, ort_inputs)\n",
    "    latency.append(time.time() - start)\n",
    "print(\"OnnxRuntime cpu Inference time = {} ms\".format(format(sum(latency) * 1000 / len(latency), '.2f')))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "***** Verifying correctness *****\n",
      "PyTorch and ONNX Runtime output 0 are close: True\n",
      "PyTorch and ONNX Runtime output 1 are close: True\n"
     ]
    }
   ],
   "source": [
    "print(\"***** Verifying correctness *****\")\n",
    "for i in range(2):\n",
    "    print('PyTorch and ONNX Runtime output {} are close:'.format(i), numpy.allclose(ort_outputs[i], outputs[i].cpu(), rtol=1e-05, atol=1e-04))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Offline Optimization Script and Test Tools\n",
    "\n",
    "It is recommended to try [OnnxRuntime Transformer Model Optimization Tool](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers) on the exported ONNX models. It could help verify whether the model can be fully optimized, and get performance test results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Transformer Optimizer\n",
    "\n",
    "Although OnnxRuntime could optimize Bert model exported by PyTorch. Sometime, model cannot be fully optimized due to different reasons:\n",
    "* A new subgraph pattern is generated by new version of export tool, and the pattern is not covered by older version of OnnxRuntime. \n",
    "* The exported model uses dynamic axis and this makes it harder for shape inference of the graph. That blocks some optimization to be applied.\n",
    "* Some optimization is better to be done offline. Like change input tensor type from int64 to int32 to avoid extra Cast nodes, or convert model to float16 to achieve better performance in V100 or T4 GPU.\n",
    "\n",
    "We have python script **optimizer.py**, which is more flexible in graph pattern matching and model conversion (like float32 to float16). You can also use it to verify whether a Bert model is fully optimized.\n",
    "\n",
    "In this example, we can see that it introduces optimization that is not provided by onnxruntime: SkipLayerNormalization and bias fusion, which is not fused in OnnxRuntime due to shape inference as mentioned.\n",
    "\n",
    "It will also tell whether the model is fully optimized or not. If not, that means you might need change the script to fuse some new pattern of subgraph.\n",
    "\n",
    "Example Usage:\n",
    "```\n",
    "from onnxruntime.transformers import optimizer\n",
    "optimized_model = optimizer.optimize_model(export_model_path, model_type='bert', num_heads=12, hidden_size=768)\n",
    "optimized_model.save_model_to_file(optimized_model_path)\n",
    "```\n",
    "\n",
    "You can also use command line like the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "               apply: Fused LayerNormalization: 25\n",
      "               apply: Fused Gelu: 12\n",
      "               apply: Fused SkipLayerNormalization: 24\n",
      "               apply: Fused Attention: 12\n",
      "         prune_graph: Removed 5 nodes\n",
      "               apply: Fused EmbedLayerNormalization(with mask): 1\n",
      "         prune_graph: Removed 10 nodes\n",
      "               apply: Fused BiasGelu: 12\n",
      "               apply: Fused SkipLayerNormalization(add bias): 24\n",
      "            optimize: opset version: 11\n",
      "get_fused_operator_statistics: Optimized operators:{'EmbedLayerNormalization': 1, 'Attention': 12, 'MultiHeadAttention': 0, 'Gelu': 0, 'FastGelu': 0, 'BiasGelu': 12, 'GemmFastGelu': 0, 'LayerNormalization': 0, 'SkipLayerNormalization': 24, 'QOrderedAttention': 0, 'QOrderedGelu': 0, 'QOrderedLayerNormalization': 0, 'QOrderedMatMul': 0}\n",
      "                main: The model has been fully optimized.\n",
      "  save_model_to_file: Sort graphs in topological order\n",
      "  save_model_to_file: Model saved to ..\\onnx_models\\bert-base-cased-squad_opt_cpu.onnx\n"
     ]
    }
   ],
   "source": [
    "optimized_model_path = os.path.join(output_dir, 'bert-base-cased-squad_opt_cpu.onnx')\n",
    "\n",
    "!{sys.executable} -m onnxruntime.transformers.optimizer --input $export_model_path --output $optimized_model_path --model_type bert"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Optimized Graph\n",
    "When you can open the optimized model using Netron to visualize, the graph is like the following:\n",
    "<img src='images/optimized_bert_gpu.png'>\n",
    "\n",
    "For CPU, optimized graph is slightly different: FastGelu is replaced by BiasGelu."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Serving '..\\onnx_models\\bert-base-cased-squad_opt_cpu.onnx' at http://localhost:8080\n"
     ]
    }
   ],
   "source": [
    "import netron\n",
    "\n",
    "# Change it to False to skip viewing the optimized model in browser.\n",
    "enable_netron = True\n",
    "if enable_netron:\n",
    "    # If you encounter error \"access a socket in a way forbidden by its access permissions\", install Netron as standalone application instead.\n",
    "    netron.start(optimized_model_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Model Results Comparison Tool\n",
    "\n",
    "If your BERT model has three inputs, a script compare_bert_results.py can be used to do a quick verification. The tool will generate some fake input data, and compare results from both the original and optimized models. If outputs are all close, it is safe to use the optimized model.\n",
    "\n",
    "Example of verifying models:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100% passed for 100 random inputs given thresholds (rtol=0.001, atol=0.0001).\n",
      "maximum absolute difference=2.771615982055664e-06\n",
      "maximum relative difference=0.017596205696463585\n"
     ]
    }
   ],
   "source": [
    "!{sys.executable} -m onnxruntime.transformers.compare_bert_results --baseline_model $export_model_path --optimized_model $optimized_model_path --batch_size 1 --sequence_length 128 --samples 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Performance Test Tool\n",
    "\n",
    "This tool measures performance of BERT model inference using OnnxRuntime Python API.\n",
    "\n",
    "The following command will create 100 samples of batch_size 1 and sequence length 128 to run inference, then calculate performance numbers like average latency and throughput etc. You can increase number of samples (recommended 1000) to get more stable result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Running test: model=bert-base-cased-squad_opt_cpu.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,batch_size=1,sequence_length=128,test_cases=100,test_times=1,use_gpu=False\n",
      "Average latency = 52.26 ms, Throughput = 19.14 QPS\n",
      "Running test: model=bert-base-cased-squad_opt_cpu.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=11,batch_size=1,sequence_length=128,test_cases=100,test_times=1,use_gpu=False\n",
      "Average latency = 52.67 ms, Throughput = 18.99 QPS\n",
      "Running test: model=bert-base-cased-squad_opt_cpu.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=10,batch_size=1,sequence_length=128,test_cases=100,test_times=1,use_gpu=False\n",
      "Average latency = 54.57 ms, Throughput = 18.32 QPS\n",
      "Running test: model=bert-base-cased-squad_opt_cpu.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=9,batch_size=1,sequence_length=128,test_cases=100,test_times=1,use_gpu=False\n",
      "Average latency = 63.12 ms, Throughput = 15.84 QPS\n",
      "Running test: model=bert-base-cased-squad_opt_cpu.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=8,batch_size=1,sequence_length=128,test_cases=100,test_times=1,use_gpu=False\n",
      "Average latency = 59.80 ms, Throughput = 16.72 QPS\n",
      "Running test: model=bert-base-cased-squad_opt_cpu.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=7,batch_size=1,sequence_length=128,test_cases=100,test_times=1,use_gpu=False\n",
      "Average latency = 64.72 ms, Throughput = 15.45 QPS\n",
      "Running test: model=bert-base-cased-squad_opt_cpu.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=6,batch_size=1,sequence_length=128,test_cases=100,test_times=1,use_gpu=False\n",
      "Average latency = 58.62 ms, Throughput = 17.06 QPS\n",
      "Running test: model=bert-base-cased-squad_opt_cpu.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=5,batch_size=1,sequence_length=128,test_cases=100,test_times=1,use_gpu=False\n",
      "Average latency = 73.72 ms, Throughput = 13.56 QPS\n",
      "Running test: model=bert-base-cased-squad_opt_cpu.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=4,batch_size=1,sequence_length=128,test_cases=100,test_times=1,use_gpu=False\n",
      "Average latency = 88.76 ms, Throughput = 11.27 QPS\n",
      "Running test: model=bert-base-cased-squad_opt_cpu.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=3,batch_size=1,sequence_length=128,test_cases=100,test_times=1,use_gpu=False\n",
      "Average latency = 91.14 ms, Throughput = 10.97 QPS\n",
      "Running test: model=bert-base-cased-squad_opt_cpu.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=2,batch_size=1,sequence_length=128,test_cases=100,test_times=1,use_gpu=False\n",
      "Average latency = 119.80 ms, Throughput = 8.35 QPS\n",
      "Running test: model=bert-base-cased-squad_opt_cpu.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,batch_size=1,sequence_length=128,test_cases=100,test_times=1,use_gpu=False\n",
      "Average latency = 225.06 ms, Throughput = 4.44 QPS\n",
      "test setting TestSetting(batch_size=1, sequence_length=128, test_cases=100, test_times=1, use_gpu=False, use_io_binding=False, provider=None, intra_op_num_threads=None, seed=3, verbose=False, log_severity=2)\n",
      "Generating 100 samples for batch_size=1 sequence_length=128\n",
      "Test summary is saved to ..\\onnx_models\\perf_results_CPU_B1_S128_20230805-202517.txt\n"
     ]
    }
   ],
   "source": [
    "!{sys.executable} -m onnxruntime.transformers.bert_perf_test --model $optimized_model_path --batch_size 1 --sequence_length 128 --samples 100 --test_times 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's load the summary file and take a look."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "..\\onnx_models\\perf_results_CPU_B1_S128_20230805-202517.txt\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Latency(ms)</th>\n",
       "      <th>Latency_P75</th>\n",
       "      <th>Latency_P90</th>\n",
       "      <th>Latency_P99</th>\n",
       "      <th>Throughput(QPS)</th>\n",
       "      <th>intra_op_num_threads</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>52.26</td>\n",
       "      <td>52.92</td>\n",
       "      <td>60.94</td>\n",
       "      <td>88.51</td>\n",
       "      <td>19.14</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>52.67</td>\n",
       "      <td>52.73</td>\n",
       "      <td>54.06</td>\n",
       "      <td>60.48</td>\n",
       "      <td>18.99</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>54.57</td>\n",
       "      <td>55.05</td>\n",
       "      <td>56.52</td>\n",
       "      <td>59.62</td>\n",
       "      <td>18.32</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>58.62</td>\n",
       "      <td>65.86</td>\n",
       "      <td>71.94</td>\n",
       "      <td>79.38</td>\n",
       "      <td>17.06</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>59.80</td>\n",
       "      <td>61.09</td>\n",
       "      <td>65.72</td>\n",
       "      <td>72.79</td>\n",
       "      <td>16.72</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>63.12</td>\n",
       "      <td>61.92</td>\n",
       "      <td>77.47</td>\n",
       "      <td>129.26</td>\n",
       "      <td>15.84</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>64.72</td>\n",
       "      <td>65.94</td>\n",
       "      <td>68.54</td>\n",
       "      <td>75.00</td>\n",
       "      <td>15.45</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>73.72</td>\n",
       "      <td>80.43</td>\n",
       "      <td>86.39</td>\n",
       "      <td>92.95</td>\n",
       "      <td>13.56</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>88.76</td>\n",
       "      <td>96.30</td>\n",
       "      <td>101.13</td>\n",
       "      <td>109.78</td>\n",
       "      <td>11.27</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>91.14</td>\n",
       "      <td>99.14</td>\n",
       "      <td>104.41</td>\n",
       "      <td>110.69</td>\n",
       "      <td>10.97</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>119.80</td>\n",
       "      <td>119.78</td>\n",
       "      <td>123.66</td>\n",
       "      <td>130.64</td>\n",
       "      <td>8.35</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>225.06</td>\n",
       "      <td>227.11</td>\n",
       "      <td>229.45</td>\n",
       "      <td>252.40</td>\n",
       "      <td>4.44</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Latency(ms)  Latency_P75  Latency_P90  Latency_P99  Throughput(QPS)  \\\n",
       "0         52.26        52.92        60.94        88.51            19.14   \n",
       "1         52.67        52.73        54.06        60.48            18.99   \n",
       "2         54.57        55.05        56.52        59.62            18.32   \n",
       "3         58.62        65.86        71.94        79.38            17.06   \n",
       "4         59.80        61.09        65.72        72.79            16.72   \n",
       "5         63.12        61.92        77.47       129.26            15.84   \n",
       "6         64.72        65.94        68.54        75.00            15.45   \n",
       "7         73.72        80.43        86.39        92.95            13.56   \n",
       "8         88.76        96.30       101.13       109.78            11.27   \n",
       "9         91.14        99.14       104.41       110.69            10.97   \n",
       "10       119.80       119.78       123.66       130.64             8.35   \n",
       "11       225.06       227.11       229.45       252.40             4.44   \n",
       "\n",
       "    intra_op_num_threads  \n",
       "0                     12  \n",
       "1                     11  \n",
       "2                     10  \n",
       "3                      6  \n",
       "4                      8  \n",
       "5                      9  \n",
       "6                      7  \n",
       "7                      5  \n",
       "8                      4  \n",
       "9                      3  \n",
       "10                     2  \n",
       "11                     1  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "import glob     \n",
    "import pandas\n",
    "latest_result_file = max(glob.glob(os.path.join(output_dir, \"perf_results_*.txt\")), key=os.path.getmtime)\n",
    "result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\n",
    "print(latest_result_file)\n",
    "\n",
    "# Remove some columns that have same values for all rows.\n",
    "columns_to_remove = ['model', 'graph_optimization_level', 'batch_size', 'sequence_length', 'test_cases', 'test_times', 'use_gpu']\n",
    "# Hide some latency percentile columns to fit screen width.\n",
    "columns_to_remove.extend(['Latency_P50', 'Latency_P95'])\n",
    "result_data.drop(columns_to_remove, axis=1, inplace=True)\n",
    "result_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Quantization\n",
    "\n",
    "Please see example in [BERT quantization notebook](https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/notebooks/bert/Bert-GLUE_OnnxRuntime_quantization.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Additional Info\n",
    "\n",
    "Note that running Jupyter Notebook has slight impact on performance result since Jupyter Notebook is using system resources like CPU and memory etc. It is recommended to close Jupyter Notebook and other applications, then run the performance test tool in a console to get more accurate performance numbers.\n",
    "\n",
    "We have a [benchmark script](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/run_benchmark.sh). It is recommended to use it compare inference speed of OnnxRuntime with PyTorch.\n",
    "\n",
    "[OnnxRuntime C API](https://github.com/microsoft/onnxruntime/blob/main/docs/C_API.md) could get slightly better performance than python API. If you use C API in inference, you can use OnnxRuntime_Perf_Test.exe built from source to measure performance instead.\n",
    "\n",
    "Here is the machine configuration that generated the above results. The machine has GPU but not used in CPU inference.\n",
    "You might get slower or faster result based on your hardware."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"gpu\": {\n",
      "    \"driver_version\": \"472.88\",\n",
      "    \"devices\": [\n",
      "      {\n",
      "        \"memory_total\": 12884901888,\n",
      "        \"memory_available\": 12732858368,\n",
      "        \"name\": \"NVIDIA GeForce RTX 3060\"\n",
      "      }\n",
      "    ]\n",
      "  },\n",
      "  \"cpu\": {\n",
      "    \"brand\": \"Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz\",\n",
      "    \"cores\": 6,\n",
      "    \"logical_cores\": 12,\n",
      "    \"hz\": \"3192000000,0\",\n",
      "    \"l2_cache\": 1572864,\n",
      "    \"flags\": \"3dnow,3dnowprefetch,abm,acpi,adx,aes,apic,avx,avx2,bmi1,bmi2,clflush,clflushopt,cmov,cx16,cx8,de,dtes64,dts,erms,est,f16c,fma,fpu,fxsr,hle,ht,hypervisor,ia64,invpcid,lahf_lm,mca,mce,mmx,monitor,movbe,mpx,msr,mtrr,osxsave,pae,pat,pbe,pcid,pclmulqdq,pdcm,pge,pni,popcnt,pse,pse36,rdrnd,rdseed,rtm,sep,serial,sgx,sgx_lc,smap,smep,ss,sse,sse2,sse4_1,sse4_2,ssse3,tm,tm2,tsc,tscdeadline,vme,x2apic,xsave,xtpr\",\n",
      "    \"processor\": \"Intel64 Family 6 Model 158 Stepping 10, GenuineIntel\"\n",
      "  },\n",
      "  \"memory\": {\n",
      "    \"total\": 16977195008,\n",
      "    \"available\": 5000630272\n",
      "  },\n",
      "  \"os\": \"Windows-10-10.0.22621-SP0\",\n",
      "  \"python\": \"3.10.12.final.0 (64 bit)\",\n",
      "  \"packages\": {\n",
      "    \"flatbuffers\": \"23.5.26\",\n",
      "    \"numpy\": \"1.25.2\",\n",
      "    \"onnx\": \"1.14.0\",\n",
      "    \"onnxruntime\": \"1.15.1\",\n",
      "    \"protobuf\": \"4.23.4\",\n",
      "    \"sympy\": \"1.12\",\n",
      "    \"torch\": \"2.0.1\",\n",
      "    \"transformers\": \"4.18.0\"\n",
      "  },\n",
      "  \"onnxruntime\": {\n",
      "    \"version\": \"1.15.1\",\n",
      "    \"support_gpu\": false\n",
      "  },\n",
      "  \"pytorch\": {\n",
      "    \"version\": \"2.0.1+cpu\",\n",
      "    \"support_gpu\": false,\n",
      "    \"cuda\": null\n",
      "  },\n",
      "  \"tensorflow\": null\n",
      "}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "f:\\anaconda3\\envs\\cpu_env\\lib\\site-packages\\onnxruntime\\transformers\\machine_info.py:127: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html\n",
      "  import pkg_resources\n"
     ]
    }
   ],
   "source": [
    "!{sys.executable} -m onnxruntime.transformers.machine_info --silent"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "cpu_env",
   "language": "python",
   "name": "cpu_env"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
